The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We dowloaded the multiple choice item survey results in csv format and placed it in our GitHub repo
Importing Multiple Choice data
linkMC<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/multipleChoiceResponses.csv"
#importing MC items
MC<-read_csv (linkMC)
dim(MC)
## [1] 16716 228
#lets create a unique ID variable
MC$id <- seq.int(nrow(MC))
Ignore this codeImporting conversionrates data incase we want to do analyses
# link_conversion<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/conversionRates.csv"
# #importing MC items
# conversion<-read_csv (link_conversion)
# dim(conversion)
# #lets create a unique ID variable
# conversion$id <- seq.int(nrow(conversion))
This project will answer this globalresearch question Which are the most values data science skills? The following 6 research questions will provide answer to this global question.
What is the relationship between the most popular platforms for learning DS and X (Niteen)? Alternatively phrased: What data science learning resources and which locations of open data are utilized by people of varying levels of education? (delete me if you need to!)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
Does survey takers' formal education has any relationship to the ML/DS method he
or she is most excited about learning in the next year? (Binish)
To do the analysis, we concentrate on two colums in the dataset -
FormalEducation and MLMethodNextYearSelect
FormalEducation : Which level of formal education have you attained?
MLMethodNextYearSelect : Which ML/DS method are you most excited about learning
in the next year?
These questions are asked to all participants.
First we plot the distribution of formal education in the dataset
The data set predominantly contains candidates with Master's degree.
Now let's look at the different ML/DS methods in the dataset
| ML Methods |
|---|
| Random Forests |
| Deep learning |
| Neural Nets |
| Text Mining |
| Genetic & Evolutionary Algorithms |
| Link Analysis |
| Rule Induction |
| Regression |
| Proprietary Algorithms |
| I don’t plan on learning a new ML/DS method |
| Ensemble Methods (e.g. boosting, bagging) |
| Factor Analysis |
| Social Network Analysis |
| Monte Carlo Methods |
| Time Series Analysis |
| Other |
| Bayesian Methods |
| Survival Analysis |
| MARS |
| Anomaly Detection |
| Cluster Analysis |
| Decision Trees |
| Association Rules |
| Uplift Modeling |
| Support Vector Machines (SVM) |
Now we can plot the distribution of ML/DS methods with formal education
| FormalEducation | MLMethodNextYearSelect | percentage |
|---|---|---|
| Bachelor | Deep learning | 0.40 |
| Bachelor | Neural Nets | 0.14 |
| Bachelor | Time Series Analysis | 0.06 |
| College Dropout | Deep learning | 0.37 |
| College Dropout | Neural Nets | 0.16 |
| College Dropout | Time Series Analysis | 0.07 |
| Doctoral | Deep learning | 0.44 |
| Doctoral | Neural Nets | 0.10 |
| Doctoral | Bayesian Methods | 0.06 |
| Doctoral | Time Series Analysis | 0.06 |
| Masters | Deep learning | 0.40 |
| Masters | Neural Nets | 0.12 |
| Masters | Time Series Analysis | 0.07 |
| Professional | Deep learning | 0.38 |
| Professional | Neural Nets | 0.14 |
| Professional | Time Series Analysis | 0.05 |
| High School | Deep learning | 0.39 |
| High School | Neural Nets | 0.14 |
| High School | Genetic & Evolutionary Algorithms | 0.10 |
Deep Learning is the top most ML/DS method in all categories of formal education
followed by Neural Nets. Except High school graduates, all others wants to learn
Time Series Analysis as the third ML/DS method. High school graduates want to
learn Genetic & Evolutionary Algorithms as theri third choice. Among doctoral
survey takers, Bayesian Methods is the third preference.
What are the most frequently used DS methods? Where is the most time spent in terms of working with data? Do either of these correlate with job title or level of education? (Zach)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skills and tools they are using? (Betsy)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
# Select only variables that seem most related to “Which are the most valued data science skills?”
# I May narrow down these columns even more later, but want to leave as much as possible for now
# Filter for US Only
USOnly <- MC %>%
select(-c(56:73, 76:79, 167:196, 198:228)) %>%
filter(Country=='United States')
# Separate those employed in Data Science from those who are not.
# Filter for Employed only, TitleFit better than 'Poorly', and CodeWriters only
# Remove those that said they are "Employed by a company that doesn't perform advanced analytics"
employed <- USOnly %>%
filter(!grepl('Not employed',EmploymentStatus),
TitleFit!="Poorly",
!grepl('doesn\'t perform advanced analytics',CurrentEmployerType),
CodeWriter=="Yes",
JobFunctionSelect != 'Build and/or run the data infrastructure')
# Filter for Data Science Learners who are not employed.
# The Survey failed to capture those who are employed and ALSO students or learners!!!
# Didn't bother to ask employed respondents if they were also sudying Data Science.
learner <- USOnly %>%
filter(grepl('Not employed',EmploymentStatus),
grepl('Yes',LearningDataScience))
# Get rid of empty columns
employed <- remove_empty_cols(employed)
learner <- remove_empty_cols(learner)
glimpse(employed)
## Observations: 1,676
## Variables: 125
## $ GenderSelect <chr> "Male", "Male", "Ma...
## $ Country <chr> "United States", "U...
## $ Age <int> 35, 25, 33, NA, 35,...
## $ EmploymentStatus <chr> "Employed full-time...
## $ CodeWriter <chr> "Yes", "Yes", "Yes"...
## $ CurrentJobTitleSelect <chr> "Computer Scientist...
## $ TitleFit <chr> "Fine", "Fine", "Pe...
## $ CurrentEmployerType <chr> "Employed by govern...
## $ MLToolNextYearSelect <chr> "TensorFlow", "Amaz...
## $ MLMethodNextYearSelect <chr> "Text Mining", "Dee...
## $ LanguageRecommendationSelect <chr> "R", "Python", "Mat...
## $ PublicDatasetsSelect <chr> "Dataset aggregator...
## $ LearningPlatformSelect <chr> "Arxiv,Blogs,Kaggle...
## $ LearningPlatformUsefulnessArxiv <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessBlogs <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessCollege <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessCompany <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessConferences <chr> NA, NA, "Not Useful...
## $ LearningPlatformUsefulnessFriends <chr> NA, NA, "Somewhat u...
## $ LearningPlatformUsefulnessKaggle <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessNewsletters <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessCommunities <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessDocumentation <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessCourses <chr> NA, NA, NA, "Very u...
## $ LearningPlatformUsefulnessProjects <chr> "Somewhat useful", ...
## $ LearningPlatformUsefulnessPodcasts <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessSO <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessTextbook <chr> "Very useful", "Som...
## $ LearningPlatformUsefulnessTradeBook <chr> NA, NA, NA, NA, NA,...
## $ LearningPlatformUsefulnessTutoring <chr> NA, "Very useful", ...
## $ LearningPlatformUsefulnessYouTube <chr> NA, NA, NA, NA, NA,...
## $ BlogsPodcastsNewslettersSelect <chr> NA, NA, NA, "KDnugg...
## $ DataScienceIdentitySelect <chr> "No", "Yes", "No", ...
## $ FormalEducation <chr> "Master's degree", ...
## $ UniversityImportance <chr> "Very important", "...
## $ JobFunctionSelect <chr> "Build and/or run t...
## $ WorkAlgorithmsSelect <chr> NA, "CNNs,Neural Ne...
## $ WorkToolsSelect <chr> "C/C++,Cloudera,Had...
## $ WorkToolsFrequencyAmazonML <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyAWS <chr> NA, NA, NA, "Often"...
## $ WorkToolsFrequencyAngoss <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyC <chr> "Sometimes", "Often...
## $ WorkToolsFrequencyCloudera <chr> "Most of the time",...
## $ WorkToolsFrequencyDataRobot <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyFlume <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyGCP <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyHadoop <chr> "Most of the time",...
## $ WorkToolsFrequencyIBMCognos <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMSPSSModeler <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMSPSSStatistics <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyIBMWatson <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyImpala <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyJava <chr> "Most of the time",...
## $ WorkToolsFrequencyJulia <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyJupyter <chr> NA, "Most of the ti...
## $ WorkToolsFrequencyKNIMECommercial <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyKNIMEFree <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMathematica <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMATLAB <chr> NA, NA, "Most of th...
## $ WorkToolsFrequencyAzure <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyExcel <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMicrosoftRServer <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMicrosoftSQL <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyMinitab <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyNoSQL <chr> "Most of the time",...
## $ WorkToolsFrequencyOracle <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyOrange <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyPerl <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyPython <chr> NA, "Most of the ti...
## $ WorkToolsFrequencyQlik <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyR <chr> "Sometimes", NA, NA...
## $ WorkToolsFrequencyRapidMinerCommercial <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyRapidMinerFree <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySalfrod <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySAPBusinessObjects <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASBase <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASEnterprise <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySASJMP <chr> NA, NA, NA, NA, "Mo...
## $ WorkToolsFrequencySpark <chr> NA, NA, NA, "Someti...
## $ WorkToolsFrequencySQL <chr> NA, NA, NA, NA, "Of...
## $ WorkToolsFrequencyStan <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyStatistica <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyTableau <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencyTensorFlow <chr> NA, "Often", NA, NA...
## $ WorkToolsFrequencyTIBCO <chr> NA, NA, NA, NA, "So...
## $ WorkToolsFrequencyUnix <chr> "Most of the time",...
## $ WorkToolsFrequencySelect1 <chr> NA, NA, NA, NA, NA,...
## $ WorkToolsFrequencySelect2 <chr> NA, NA, NA, NA, NA,...
## $ WorkFrequencySelect3 <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsSelect <chr> "A/B Testing,Cross-...
## $ `WorkMethodsFrequencyA/B` <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyAssociationRules <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyBayesian <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyCNNs <chr> NA, "Most of the ti...
## $ WorkMethodsFrequencyCollaborativeFiltering <chr> NA, NA, NA, NA, NA,...
## $ `WorkMethodsFrequencyCross-Validation` <chr> "Sometimes", NA, "S...
## $ WorkMethodsFrequencyDataVisualization <chr> "Most of the time",...
## $ WorkMethodsFrequencyDecisionTrees <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyEnsembleMethods <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyEvolutionaryApproaches <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyGANs <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyGBM <chr> NA, NA, "Sometimes"...
## $ WorkMethodsFrequencyHMMs <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyKNN <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyLiftAnalysis <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyLogisticRegression <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyMLN <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyNaiveBayes <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyNLP <chr> "Most of the time",...
## $ WorkMethodsFrequencyNeuralNetworks <chr> NA, "Most of the ti...
## $ WorkMethodsFrequencyPCA <chr> NA, "Often", NA, NA...
## $ WorkMethodsFrequencyPrescriptiveModeling <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyRandomForests <chr> NA, NA, "Sometimes"...
## $ WorkMethodsFrequencyRecommenderSystems <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencyRNNs <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySegmentation <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencySimulation <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySVMs <chr> "Sometimes", NA, NA...
## $ WorkMethodsFrequencyTextAnalysis <chr> "Most of the time",...
## $ WorkMethodsFrequencyTimeSeriesAnalysis <chr> "Often", NA, NA, NA...
## $ WorkMethodsFrequencySelect1 <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySelect2 <chr> NA, NA, NA, NA, NA,...
## $ WorkMethodsFrequencySelect3 <chr> NA, NA, NA, NA, NA,...
## $ WorkDataVisualizations <chr> "26-50% of projects...
## $ id <int> 7, 22, 23, 25, 35, ...
glimpse(learner)
## Observations: 154
## Variables: 50
## $ GenderSelect <chr> "Female", "Male", "Non...
## $ Country <chr> "United States", "Unit...
## $ Age <int> 22, 47, 21, 27, 13, 23...
## $ EmploymentStatus <chr> "Not employed, and not...
## $ StudentStatus <chr> "Yes", "No", "Yes", "Y...
## $ LearningDataScience <chr> "Yes, but data science...
## $ MLToolNextYearSelect <chr> "SQL", "TensorFlow", "...
## $ MLMethodNextYearSelect <chr> "Deep learning", "Deep...
## $ LanguageRecommendationSelect <chr> "R", "Python", "Python...
## $ PublicDatasetsSelect <chr> "GitHub,Google Search,...
## $ LearningPlatformSelect <chr> "College/University,St...
## $ LearningPlatformUsefulnessArxiv <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessBlogs <chr> NA, "Somewhat useful",...
## $ LearningPlatformUsefulnessCollege <chr> "Very useful", NA, "So...
## $ LearningPlatformUsefulnessCompany <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessConferences <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessFriends <chr> NA, NA, "Very useful",...
## $ LearningPlatformUsefulnessKaggle <chr> NA, NA, "Somewhat usef...
## $ LearningPlatformUsefulnessNewsletters <chr> NA, "Somewhat useful",...
## $ LearningPlatformUsefulnessCommunities <chr> NA, NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessDocumentation <chr> NA, NA, "Not Useful", ...
## $ LearningPlatformUsefulnessCourses <chr> NA, "Very useful", "Ve...
## $ LearningPlatformUsefulnessProjects <chr> NA, NA, "Somewhat usef...
## $ LearningPlatformUsefulnessPodcasts <chr> NA, NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessSO <chr> "Somewhat useful", NA,...
## $ LearningPlatformUsefulnessTextbook <chr> NA, NA, "Not Useful", ...
## $ LearningPlatformUsefulnessTutoring <chr> NA, NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessYouTube <chr> "Somewhat useful", NA,...
## $ BlogsPodcastsNewslettersSelect <chr> "Becoming a Data Scien...
## $ LearningDataScienceTime <chr> "< 1 year", "1-2 years...
## $ JobSkillImportanceBigData <chr> "Necessary", "Nice to ...
## $ JobSkillImportanceDegree <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceStats <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceEnterpriseTools <chr> "Nice to have", "Unnec...
## $ JobSkillImportancePython <chr> "Nice to have", "Neces...
## $ JobSkillImportanceR <chr> "Necessary", "Nice to ...
## $ JobSkillImportanceSQL <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceKaggleRanking <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceMOOC <chr> "Nice to have", "Nice ...
## $ JobSkillImportanceVisualizations <chr> "Nice to have", "Neces...
## $ JobSkillImportanceOtherSelect1 <chr> NA, NA, NA, NA, NA, NA...
## $ JobSkillImportanceOtherSelect2 <chr> NA, NA, NA, NA, NA, NA...
## $ JobSkillImportanceOtherSelect3 <chr> NA, NA, NA, NA, NA, NA...
## $ CoursePlatformSelect <chr> NA, "Coursera,Udacity"...
## $ HardwarePersonalProjectsSelect <chr> "Basic laptop (Macbook...
## $ TimeSpentStudying <chr> "2 - 10 hours", "2 - 1...
## $ ProveKnowledgeSelect <chr> "Experience from work ...
## $ DataScienceIdentitySelect <chr> "No", "Yes", "Sort of ...
## $ FormalEducation <chr> "Some college/universi...
## $ id <int> 44, 85, 208, 210, 212,...
# Change columns to factors
flevels <- function(x){
factor(x, order=TRUE, levels=c("Not Useful", "Somewhat useful", "Very useful"))
}
flevels2 <- function(x){
factor(x, order=TRUE, levels=c("Rarely", "Sometimes", "Often", "Most of the time"))
}
flevels3 <- function(x){
factor(x, order=TRUE, levels=c("Unnecessary", "Nice to have", "Necessary"))
}
employed.LP <- employed[14:31] <- lapply(employed[14:31], flevels)
employed.WT <- employed[39:89] <- lapply(employed[39:89], flevels2)
employed.WM <- employed[91:120] <- lapply(employed[91:120], flevels2)
learner.LP <- learner[12:28] <- lapply(learner[12:28], flevels)
learner.JS <- learner[31:40] <- lapply(learner[31:40], flevels3)
# Take a peek at the demographics of those who are employed...
employed %>%
group_by(CurrentJobTitleSelect) %>%
summarise(total = n()) %>%
arrange(desc(total))
## # A tibble: 16 x 2
## CurrentJobTitleSelect total
## <chr> <int>
## 1 Data Scientist 559
## 2 Scientist/Researcher 185
## 3 Data Analyst 155
## 4 Software Developer/Software Engineer 151
## 5 Other 149
## 6 Researcher 110
## 7 Machine Learning Engineer 79
## 8 Statistician 62
## 9 Engineer 53
## 10 Business Analyst 45
## 11 Predictive Modeler 38
## 12 Computer Scientist 36
## 13 DBA/Database Engineer 17
## 14 Programmer 17
## 15 Operations Research Practitioner 13
## 16 Data Miner 7
# Need to Tidy this data so that each response is in a separate row rather than all in one
employed %>%
group_by(CurrentEmployerType) %>%
summarise(total = n()) %>%
arrange(desc(total))
## # A tibble: 34 x 2
## CurrentEmployerType total
## <chr> <int>
## 1 Employed by a company that performs advanced analytics 549
## 2 Employed by college or university 291
## 3 Employed by professional services/consulting firm 268
## 4 Employed by company that makes advanced analytic software 182
## 5 Self-employed 94
## 6 Employed by government 81
## 7 Employed by non-profit or NGO 64
## 8 Employed by company that makes advanced analytic software,Employ… 54
## 9 Employed by professional services/consulting firm,Employed by a … 19
## 10 Employed by professional services/consulting firm,Employed by co… 17
## # ... with 24 more rows
# Take a peek at the demographics of those who are learners...
learner %>%
group_by(StudentStatus, LearningDataScience) %>%
summarise(total = n()) %>%
arrange(desc(total))
## # A tibble: 4 x 3
## # Groups: StudentStatus [2]
## StudentStatus LearningDataScience total
## <chr> <chr> <int>
## 1 Yes Yes, I'm focused on learning mostly data science sk… 63
## 2 Yes Yes, but data science is a small part of what I'm f… 50
## 3 No Yes, I'm focused on learning mostly data science sk… 23
## 4 No Yes, but data science is a small part of what I'm f… 18
net_stacked(learner.LP)
net_stacked(learner.JS)
net_stacked(employed.LP)
net_stacked(employed.WT)
net_stacked(employed.WM)
Is there any interaction between the Kaggle survey takers’ program language use (R or Python) and their recommended program languages? (e.g. R users recommending R more than Python users recommending Python) (Burcu)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
dim(MC)
## [1] 16716 229
tb1<-MC %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb1)
#removing NAs and empty values in column=WorkToolsSelect
df <- MC[!(MC$WorkToolsSelect == "" | is.na(MC$WorkToolsSelect)), ]
dim(df)
## [1] 7955 229
tb2<-df %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb2)
#creating a new variable called work_tools where the original column values are split
#please note that this code will generate long data
df1<-df %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
#check
tb3<-df1 %>%
select (id, WorkToolsSelect,work_tools) %>%
filter (id %in% c(1:3))
datatable(tb3)
df2<-df1 %>%
group_by(id, work_tools) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0)
df3<-df2 %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"))%>%
select (id, R, Python, lang_use)
tb4<-df3 %>%
filter (id %in% c(1:10))
datatable(tb4)
#computing percentages
df4<-df3 %>%
group_by(lang_use) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df4, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))
p<-ggplot (df4, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))
ggplotly(p)
Let’s examine the above graph by LanguageRecommendationSelect
#check
tb5<-df1 %>%
select (id, WorkToolsSelect,work_tools, LanguageRecommendationSelect) %>%
filter (id %in% c(1:3))
datatable(tb5)
df5<-df1 %>%
group_by(id, work_tools,LanguageRecommendationSelect) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0) %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"),
lang_rec = case_when (
(LanguageRecommendationSelect=="R") ~ "Recommending R ",
(LanguageRecommendationSelect=="Python" ) ~ "Recommending Python ",
(LanguageRecommendationSelect!="R" |LanguageRecommendationSelect!="Python") ~ "Recommending Neither Python nor R",
(LanguageRecommendationSelect=="NA"|LanguageRecommendationSelect==" " ) ~ "Recommending Nothing"))%>%
select (id, R, Python, lang_use,lang_rec )
dim(df5)
## [1] 7955 5
tb6<-df5 %>%
filter (id %in% c(1:10))
datatable(tb6)
#computing percentages
df6<-df5 %>%
group_by(lang_use,lang_rec) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df6, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))
p1<-ggplot (df6, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users and their recommended language") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))+
facet_wrap(~lang_rec)+
theme(legend.position = 'none')
ggplotly(p1)
We found that a little below the half of the survey takers (N=3540, 44.5%) reported to use both R and Python. The take home message for the aspiring data scientists is that substantial majority of the Kaggle survey takers are using both languages, both languages are used widely. Among the remaning half of the respondents, small portion of them (N=714, 8.98%) are using neither Python nor R. The rest of the survey takers are using either R or Python. In particular, 2533 (31.84%) indicated using only Python while only 1168 (14.68%) of them noted to use R Only.
The story of this contentious debate on R vs Python gets more interesting as to comparing their used langugaes with their recommended languages. Specifically, it is plausible to assume that Python users will recommend Python while R users will recommend R. Can we explore this htpothesis ? How much degree of difference among the Python and R users as to their recommended languages ?
Our results revealed that 72.17 % of the Python users recommended Python while 53.77% of R users recommended R. This results is not suprising to our hytpothesis. Since there are more Python only users than R only users in this sample, it makes sense to have differences in their recommendation. However, what is suprising is degree of difference in their recommendation for the other language. For example, 15.92 % of the R users recommending Python whereas only 1.42 % of the Python users are recommending R.
However, this results should be interpreted carefully because we have survey takers who did not make any recommendation. For instance 18.87 % of the sample who are Python users did not respond this question. Similary, 17.55 % of R users did not leave any opinion on their recommended languages. If these Python users would say any recommendation, would that increase overall Python users’ recommendation of R? Who knows?
Since half the sample included both R and Python users, let’s see their recommendation. The 51.72% of them recommending Python while 25.65% of them recommending R. The quarter of the both users are recommending R but Python.
The moral of the story on this contentious debate on R vs Python
+ Half of the sample uses both R and Python
+ R only to Python only users are in 1:2 ratio
+ More R users recommended Python than the Python users recommended R. + Both users recommendations in Python is more than their recommendation in R.
Of those receiving pay in US Dollars, is Python or R overall most profitable for a Kaggle survey taker? (Gabby)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
RQ6 <- MC %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
RQ6 <- RQ6 %>%
filter(!is.na(WorkToolsSelect)) %>% # Filters out all columns with NA in the WorkToolsSelect column
filter(CompensationCurrency == "USD") %>% # Makes sure to only use rows whose currency is in USD
filter(work_tools == "Python" | work_tools == "R") %>% # The work tools are R or Python, period.
select(id, work_tools, CompensationAmount) # Only have three rows to work with
RQ6_ids <- select(filter(as.data.frame(table(RQ6$id)), Freq == 1), Var1) # Only want people who use R or Python EXCLUSIVELY, not R and/or Python
RQ6_ids <- droplevels(RQ6_ids)$Var1 # Removed the levels so we can actually get the IDs
RQ6 <- filter(RQ6, id %in% RQ6_ids) # Only keep those rows whose id are inside of list of ids with R or Python exclusively used at work
RQ6 <- select(RQ6, -id) # No use for the ID anymore, it's done its job
RQ6$CompensationAmount <- gsub(",", "", RQ6$CompensationAmount) # Removed the commas from the compensation amount to prep for numeric transformation
RQ6$CompensationAmount <- as.numeric(RQ6$CompensationAmount) # made the column into a numeric for easier mathematical comparison and sorting
RQ6 <- filter(RQ6, CompensationAmount < 9999999) # ... let's just be a little realistic, nobody is earning more than fifteen million a year at this point in time or prior to it, and this one-dollar-off-from-a-million entry is an anomaly in the data set
rm(RQ6_ids) # remove the now-unused variable to save memory
RQ6_boxplot <- ggplot(RQ6) +
geom_boxplot( aes(x = factor(work_tools),
y = CompensationAmount,
fill = factor(work_tools)
)
) +
scale_y_continuous(breaks=seq(0,2000000,25000)) +
labs( x = "Programming Language",
y = "Annual Compensation in USD",
fill = "Programming Language")
RQ6_boxplot_ylim <- boxplot.stats(RQ6$CompensationAmount)$stats[c(1, 5)]
RQ6_boxplot <- RQ6_boxplot + coord_cartesian(ylim = RQ6_boxplot_ylim*1.05)
RQ6_boxplot
The average survey taker who used Python in their job made approximately $14,648.50 more than the average survey taker who used R in their job. While R users overall had a higher base pay - to the tune of $5,000.00 more than their Python counterparts - their ability to achieve growth in salary was noticeably stymied in comparison. Outliers aside, if the data collected is to be considered representative of the data science population, there is indication that a prospective Data Scientist should learn R first for a higher initial salary, and then learn Python to increase their chance of obtaining a job with more growth potential.